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ABSTRACT 

This  paper  examines  the  impact  of  multilingual  (ML)  acoustic  rep¬ 
resentations  on  Automatic  Speech  Recognition  (ASR)  and  keyword 
search  (KWS)  for  low  resource  languages  in  the  context  of  the 
OpenKWS15  evaluation  of  the  IARPA  Babel  program.  The  task 
is  to  develop  Swahili  ASR  and  KWS  systems  within  two  weeks 
using  as  little  as  3  hours  of  transcribed  data.  Multilingual  acoustic 
representations  proved  to  be  crucial  for  building  these  systems  under 
strict  time  constraints.  The  paper  discusses  several  key  insights  on 
how  these  representations  are  derived  and  used.  First,  we  present 
a  data  sampling  strategy  that  can  speed  up  the  training  of  multilin¬ 
gual  representations  without  appreciable  loss  in  ASR  performance. 
Second,  we  show  that  fusion  of  diverse  multilingual  representations 
developed  at  different  LORELEI  sites  yields  substantial  ASR  and 
KWS  gains.  Speaker  adaptation  and  data  augmentation  of  these 
representations  improves  both  ASR  and  KWS  performance  (up  to 
8.7%  relative).  Third,  incorporating  un-transcribed  data  through 
semi-supervised  learning,  improves  WER  and  KWS  performance. 
Finally,  we  show  that  these  multilingual  representations  significantly 
improve  ASR  and  KWS  performance  (relative  9%  for  WER  and  5% 
for  MTWV)  even  when  forty  hours  of  transcribed  audio  in  the  target 
language  is  available.  Multilingual  representations  significantly  con¬ 
tributed  to  the  LORELEI  KWS  systems  winning  the  OpenKWS15 
evaluation. 

Index  Terms —  Multilingual  Representation.  Hierarchical  Deep 
Neural  Network,  Keyword  Search,  BABEL 

1.  INTRODUCTION 

Multilingual  (ML)  models  have  been  shown  to  outperform  unilin- 
gual  models  for  ASR  in  low  resource  languages  [1-8].  Recently, 
ML  models  have  also  exhibited  great  advantages  in  keyword  search 
(KWS)  tasks  such  as  Babel  [9-12]  and  the  Spoken  Web  Search  Task 
held  as  part  of  MediaEval  Benchmark  [13].  This  paper  focusses  on 
the  impact  of  different  ML  representations  on  IBM’s  speech  recog¬ 
nition  and  keyword  search  systems  used  in  the  Babel  Optional  Pe¬ 
riod  2  (OP2)  surprise  language  evaluation.  These  ML  features  were 
independently  developed  at  RWTH  Aachen  (RWTH)  [11,  14,  15], 
Cambridge  University  (CUED)  [16, 17]  and  IBM. 

Multilingual  ASR  has  been  investigated  over  the  last  two 
decades.  The  approaches  can  be  broadly  classified  into  two  main  cat- 
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egories:  one.  focused  on  generating  universal  language-independent 
lexicons  and  the  second,  focused  on  language  independent  acoustic 
representations.  With  the  recent  success  of  Deep  Neural  Networks 
(DNNs),  and  their  ability  to  generalize  and  learn  useful  acoustic  rep¬ 
resentations  of  languages,  focus  has  shifted  to  using  DNN-derived 
multilingual  representations.  The  approach  that  we  take  in  this 
work,  belongs  to  the  second  category.  The  advantage  of  using  ML 
features  for  a  time-limited  evaluation  like  BABEL  is  that  the  ML 
features  themselves  can  be  trained  in  advance  of  the  evaluation  pe¬ 
riod.  However,  given  the  large  amount  of  data  (approximately  1000 
hours  spanning  10+  languages)  used  for  ML  training,  this  process 
can  be  very  time  consuming. 

Several  methods  have  been  proposed  in  the  literature  to  speed 
up  training  of  neural  networks.  While  an  exhaustive  review  of  such 
methods  is  beyond  the  scope  of  this  paper,  we  mention  a  few  relevant 
techniques  here.  Optimization  techniques  that  parallelize  training 
across  multiple  machines  have  also  been  explored  for  DNN  training 
[18-22],  However,  these  methods  involve  significant  communica¬ 
tion  costs.  In  order  to  reduce  these  data  communication  costs  [23] 
proposed  a  1-bit  quantization  of  gradients  with  nearly  no  loss  in  ac¬ 
curacy,  while  [24]  proposed  a  combined  hardware/software  solu¬ 
tion.  An  alternative  approach  uses  data  sampling  to  speed  up  train¬ 
ing.  [25]  presents  a  methodology  for  using  varying  sample  sizes  in 
batch  optimization  methods  for  large  scale  machine  learning  prob¬ 
lems.  The  authors  propose  a  criterion  for  dynamic  sample  selection 
in  the  evaluation  of  the  function  and  gradient  based  on  variance  esti¬ 
mates  obtained  during  the  computation  of  a  batch  gradient. 

In  deriving  multilingual  representations,  there  have  been  studies 
focussed  on  carefully  identifying  a  subset  of  language(s)  closest  to 
the  target  language  [26, 27],  with  subsequent  use  of  data  from  this 
subset  only  for  training  the  network.  Not  only  do  these  networks 
offer  performance  improvements,  they  train  faster  by  virtue  of  use 
of  less  data.  In  this  paper,  we  present  a  data  sampling  strategy,  that 
allows  the  network  to  see  the  training  data  across  all  languages  in 
stages.  We  show  that  this  can  shorten  the  training  time  to  one  third 
of  the  original  time  with  less  than  a  1%  relative  loss  in  speech  recog¬ 
nition  performance. 

Next,  we  present  the  impact  of  various  input  features  used  to  de¬ 
rive  multilingual  representations.  The  IBM  ML  representations  are 
derived  from  the  log-Mel  filterbank  spectrum.  In  contrast,  the  in¬ 
put  features  used  by  our  partner  sites,  are  alternate  features,  such  as 
gamma-tone  features,  RASTA  PLP  [28,29]  etc.  We  demonstrate  that 
the  performance  of  all  these  multilingual  representations  are  compa- 
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rable  regardless  of  the  input  space. 

We  also  revisit  techniques  which  have  proven  to  be  helpful  for 
speech  recognition  and  keyword  search  in  previous  years  of  the  BA¬ 
BEL  program.  These  include,  re-alignment  during  training,  semi- 
supervised  learning  (SSL),  and  data  augmentation.  In  this  work,  the 
above  techniques  are  re-examined  under  the  context  of  ML  represen¬ 
tations. 

The  rest  of  the  paper  is  organized  as  follows.  Section  2  decribes 
the  IBM  ML  framework,  analyzing  the  proposed  data  sampling  strat¬ 
egy  with  respect  to  accuracy  and  training  speed  ups.  Section  3  briefly 
introduces  the  Babel  OP2  task  and  its  three  conditions,  as  well  as 
IBM  ASR  and  KWS  systems.  Section  4  presents  the  recognition 
performance  of  diverse  ML  representations  for  the  two  low  resource 
conditions  studied.  A  comparison  of  unilingual  models,  SSL  mod¬ 
els  and  speaker-adapted  models  is  also  presented.  Section  5  presents 
our  preliminary  results  on  adapting  the  ML  features  to  the  target 
language.  Section  6  demonstrates  the  impact  of  the  ML  representa¬ 
tions  on  keyword  search  performance.  Finally,  we  analyze  the  use 
of  ML  representations  in  KWS  for  both,  low-resource  (3  hours)  and 
medium  resource  conditions  (40  hours).  The  paper  concludes  with 
key  messages  in  Section  7. 

2.  A  SIMPLE  HIERARCHICAL  MULTILINGUAL  MODEL 
WITH  DATA  SAMPLING 

The  neural  network  architecture  presented  in  this  paper  is  hierarchi¬ 
cal  and  modeled  after  the  topology  proposed  in  [  1 1  ] .  It  combines  the 
multilingual  training  strategy  from  [30]  with  the  stacked  DNN  struc¬ 
ture  from  [31].  The  hierarchical  DNN  based  model  with  ML  repre¬ 
sentations  is  illustrated  in  Figure  1.  The  two  DNNs  in  this  stacked  ar¬ 
chitecture  have  a  similar  structure  with  5  layers  comprising  of  1024 
sigmoid  units  each,  except  for  the  bottleneck  layer,  which  has  80 
sigmoid  units,  and  a  final  soft-max  layer. 

As  illustrated  in  Fig.  1,  the  input  layer  to  the  first  DNN  are  40- 
dimensional  log-mel  filter  bank  features  spliced  together  with  a  con¬ 
text  +/-5  frames.  The  second  DNN  uses  the  80-dimensional  bot¬ 
tleneck  features  extracted  from  the  first  DNN.  The  context  is  ex¬ 
panded  to  include  10  frames  on  each  side  and  then  subsampled  at  a 
five-frame  interval  to  produce  a  400-dimension  input  vector  for  the 
second  DNN.  Both  DNNs,  use  independent  softmax  output  layers 
corresponding  to  each  of  the  10  training  languages  used:  Assamese, 
Bengali,  Pashto,  Turkish,  Tagalog.  Vietnamese,  Haitian  Creole,  Lao, 
Tamil  and  Zulu.  These  languages  cover  the  languages  used  in  the 
Base  and  OP1  evaluation  periods  of  the  Babel  program  [9].  We 
used  the  development  data  from  the  Assamese  language  as  the  held- 
out  set  to  determine  the  stopping  criterion  for  training  this  multilin¬ 
gual  network.  While  it  is  possible  to  have  a  fully  connected  final 
layer  across  the  targets  of  all  languages,  we  choose  this  representa¬ 
tion  to  allow  for  faster  training  of  the  network  resulting  from  fewer 
parameters  in  the  last  layer.  All  hidden  layers  are  shared  across  all 
languages,  allowing  the  network  to  learn  a  truly  multilingual  rep¬ 
resentation.  The  output  targets  for  each  of  the  languages  are  the 
context  dependent  states  derived  from  unilingually  trained,  speaker 
adapted  decision  trees.  However,  given  these  languages  were  pro¬ 
cessed  during  different  phases  of  the  program,  we  simply  reused  the 
states  that  were  generated  then,  with  alignments  generated  from  ei¬ 
ther  GMMs  or  DNNs. 

Our  multilingual  representations  on  the  target  language  are  de¬ 
rived  from  the  bottleneck  layer  of  the  second  DNN  shown  in  Fig.  1, 
by  passing  the  target  language  through  the  multilingual  network.  In 
this  paper,  we  focus  on  ML  representations  obtained  with  no  fine- 
tuning  on  the  target  language,  i.e.,  the  multilingual  network  does  not 


see  any  of  the  target  language  data. 


Fig.  1.  IBM  Hierarchical  Multilingual  DNN. 

2.1.  Data  Sampling 

In  order  to  speed  up  training  of  ML  representations,  we  propose  a 
data  sampling  strategy  that  allows  the  network  to  see  only  a  fraction 
of  the  data  in  each  epoch.  With  several  training  epochs,  the  network 
eventually  sees  all  of  the  training  data. 

The  training  data  from  each  language  is  organized  into  30  sets, 
with  each  set  comprising  of  several  mini  batches.  During  each  train¬ 
ing  epoch,  for  each  language,  a  fraction,  r  of  the  training  data  is 
used.  It  is  possible  for  some  sets  to  be  used  more  often  than  others 
during  the  training  process.  The  model  is  trained  with  15  epochs  of 
Stochastic  Gradient  Descent(SGD)  on  a  single  NVIDIA  K40x  GPU. 

Table  1  lists  the  converged  cross-entropy  objective  function  val¬ 
ues  on  the  held-out  set,  and  the  corresponding  times  to  train  the  net¬ 
works  for  different  sampling  ratios  r.  We  explore  different  sampling 
ratios  for  training  both  DNNs  in  the  hierarchical  architecture.  When 
a  sampling  ratio  of  1/6  is  used  for  both  DNNs  (IBM1)  i.e.,  both 
networks  train  on  only  one-sixth  of  the  overall  training  data  during 
each  epoch,  the  fastest  training  time  of  four  days  is  observed.  A 
slightly  better  convergence  is  obtained  when  the  second  network  in 
the  stacked  architecture  is  allowed  to  train  on  half  the  training  data 
per  epoch  (IBM2).  However,  the  training  time  increases  proportion¬ 
ately  by  50%  relative.  Using  the  entire  training  set  per  epoch  for 
training  the  second  net,  does  decrease  the  objective  slightly  further, 
but  at  the  expense  of  a  much  higher  training  period  (10  days).  If  the 
entire  training  data  (IBM)  is  used  at  every  epoch  to  train  both  nets, 
the  training  time  increases  drastically  to  three  weeks.  However,  the 
objective  function  value  achieved  with  all  the  data  being  used  for 
both  nets  is  only  slightly  better  (1.12)  compared  to  using  half  the 
training  data  for  the  second  DNN  and  l/6th  for  the  first  net(1.17). 
Our  experiments  also  showed  that  a  single  net  is  only  able  to  bring 
the  objective  down  to  1.57  when  trained  with  l/6th  of  the  data  per 
epoch.  Even  when  using  the  entire  training  data  per  epoch,  only  a 
very  small  improvement  in  objective,  to  1.47  can  be  achieved,  while 
the  training  time  increases  drastically  by  a  factor  of  three. 

As  a  sanity  check,  we  compared  the  objective  values  presented 
in  Table  1  on  the  held-out  set  against  the  converged  objective  from 
a  unilingual  DNN  model  (1.23)  trained  with  the  FLP  data  from  the 
language  of  the  held-out  set,  i.e.  Assamese.  The  hierarchical  multi¬ 
lingual  net  converges  to  a  better  objective  (1.12)  than  the  baseline. 
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DNN.l 

DNN.2 

XENT 

Training  time  (Days) 

r  =  1/6(IBM1) 

r  =  1/6 

1.21 

4 

r  =  1/6(IBM2) 

r  =  1/2 

1.17 

6 

r  =  1/6(IBM3) 

r  =  1 

1.15 

10 

r  =  l(IBM) 

r  =  1 

1.12 

21 

Table  1.  ML  training  with  various  sampling  ratios. 


3.  IARPA  BABEL  OP2  SURPRISE  LANGUAGE 
EVALUATION 

The  work  reported  in  this  paper  is  focused  on  the  IARPA  Babel  OP2 
surprise  language  (Swahili)  evaluation.  There  are  three  evaluation 
scenarios  based  on  the  amount  of  transcribed  training  data,  namely. 
Very  Limited  Language  Pack  (VLLP),  Active  Language  Pack  (ALP) 
and  Full  Language  Pack  (FLP).  In  the  VLLP  case,  the  training  data 
comprises  of  only  3  hours  of  transcribed  audio.  Particpants  have 
only  two  weeks  to  train  ASR  and  KWS  systems.  In  the  ALP  case, 
one  hour  of  transcribed  audio  is  provided  initially.  However,  par¬ 
ticipants  are  allowed  to  automatically  select  two  additional  hours  of 
audio  using  the  one-hour  set  as  a  seed,  for  which  manual  transcripts 
will  be  provided  [32-37],  Both  VLLP  and  ALP  allow  the  use  of 
40  hours  of  un-transcribed  audio.  VLLP  and  ALP  conditions  serve 
as  two  different  methods  to  arrive  at  the  best  performing  ASR  and 
KWS  systems  when  very  little  transcribed  data  ia  availble.  The  al¬ 
gorithms  developed  for  these  two  conditions  are  contrasted  with  the 
FLP  scenario,  wherein,  40  hours  of  transcribed  audio  is  available. 
The  evaluation  period  runs  for  a  week,  with  70  hours  of  audio  to  be 
searched  in  all  three  scenarios.  The  keywords  are  the  same  across  all 
conditions.  Textual  data  to  derive  lexicons  and  language  models  are 
derived  from  webcrawls  [38]  and  common  across  all  participants. 
For  the  evaluation  itself,  ML  representations  and  use  of  webcrawls 
are  allowed  for  VLLP  and  ALP  conditions  only.  However,  for  this 
study,  in  order  to  obtain  a  better  understanding  of  ML  representa¬ 
tions,  we  compare  their  value  across  all  three  conditions. 

The  VLLP  training  data  contains  3K  sentences  (28K  words) 
with  a  vocabulary  size  of  5K.  In  contrast,  the  FLP  training  data  con¬ 
tains  50K  sentences  (353K  )  words  with  a  vocabulary  size  of  24K. 
The  development  data  comprises  of  15  hours  of  audio  (11K  sen¬ 
tences)  and  4k  query  terms  [39]  and  is  the  same  across  all  three  con¬ 
ditions.  An  internal  tuning  set  of  3  hours  of  audio  (3.5K  sentences) 
was  used  to  tune  the  hyperparameters  of  ASR  and  KWS  systems. 

The  ALP  evaluation  condition  is  the  selection  of  two  hours 
worth  of  segments  from  the  untranscribed  pool  for  transcribing,  and 
then  building  models  given  the  initial  one-hour  plus  the  additional 
two  hours  of  transcribed  audio.  The  untranscribed  pool  is  initially 
segmented  at  silence  regions,  followed  by  an  entropy-based  selec¬ 
tion  of  segments.  The  entropy  for  each  segment  is  computed  using  a 
grapheme  probability  density  function  computed  over  the  consensus 
network.  Segments  are  selected  in  a  round-robin  fashion  by  speaker; 
the  segment  having  the  highest  grapheme-based  entropy  for  a  given 
speaker  is  chosen  as  we  rotate  through  speakers. 

The  analyses  of  ASR  and  KWS  systems  reported  throughout  this 
paper  are  on  the  tuning  set.  The  final  tuned  systems  are  then  evalu¬ 
ated  on  the  development  set  (Section  6). 

3.1.  IBM  ASR  System 

The  baseline  speaker-independent  (SI)  acoustic  model  used  in  IBM’s 
ASR  system  is  described  below.  The  input  features  are  13-dimension 
PLP  features  with  speaker-based  mean  and  variance  normalization. 


A  context  of  9  frames  is  spliced  together  and  projected  to  a  40- 
dimensional  feature  space  using  linear  discrimant  analysis(LDA), 
and  the  class-conditional  distributions  are  further  diagonalized  us¬ 
ing  a  global,  semi-tied  covariance(STC)  transform. 

In  the  SI  ML  pipeline,  the  above  PLP+LDA+STC  features  are 
fused  with  ML  features,  transformed  by  LDA  and  STC,  and  then 
used  as  input  for  a  two-fold  DNN  pipeline.  In  each  fold,  a  new 
alignment  is  generated  with  the  current  model  and  a  new  decision 
tree  is  built  on  top  the  alignment.  The  input  features  are  of  the 
same  type  in  both  folds.  The  DNN  training  procedure  comprises 
of:  (1)  discriminative  layer-wise  pretraining  [40],  (2)  training  with 
cross-entropy  criterion  and  (3)  training  with  the  state-level  minimum 
Bayes  risk(sMBR)  criterion  [19,20],  The  DNN  comprises  of  3  hid¬ 
den  layers  of  1024  ReLU  units,  followed  by  one  1024-unit  sigmoid 
layer  and  a  128-unit  linear  layer.  In  the  SA  ML  systems,  ML  features 
spanning  a  context  of  9  frames  are  spliced  together  and  projected 
down  to  a  40  dimension  feature  space  with  LDA+STC,  followed  by 
a  constrained  MLLR  transform  [41],  All  DNN  models  used  in  this 
paper  are  hybrid  models  [20],  The  IBM  Attila  speech  recognition 
toolkit  [42]  is  used  for  training  the  models  presented  in  this  paper. 

The  baseline  language  models  (LM)  are  Kneser-Ney  (KN)- 
smoothed  bigram  models.  For  FLP  condition,  the  vocabulary  size 
was  24K  and  for  the  VLLP/ ALP  conditions,  the  vocabulary  size 
was  5  K. 

3.2.  On-the-fly  Lattice  KWS 

We  use  an  on-the-fly  version  of  lattice  based  keyword  search  to  gen¬ 
erate  our  KWS  results.  The  queries  are  read  in  and  processed  to 
create  query  Finite  State  Transducers  (FSTs).  In-vocabulary  (IV) 
queries  are  represented  at  both  the  token  (word  or  syllable)  level  and 
the  grapheme  level,  and  query  expansion  is  applied  to  the  grapheme 
FSTs  using  a  confusability  model.  OOV  queries  are  only  represented 
at  the  grapheme  level,  and  have  the  same  degree  of  query  expansion 
as  the  IV  queries.  Next,  as  ASR  is  performed  for  each  segment,  a  lat¬ 
tice  is  generated  in  memory,  converted  to  a  weighted  FST  index,  the 
queries  are  searched  for  in  the  index  via  composition,  and  any  hits 
are  recorded.  When  ASR  is  finished,  the  results  are  written  to  disk 
in  the  form  of  postings  lists  for  the  token  and  grapheme  searches. 
Finally,  the  postings  lists  are  merged  in  a  cascaded  fashion:  if  any 
token  results  for  a  query  are  present,  they  are  used;  otherwise,  the 
grapheme  results  are  used.  It  is  important  to  note  that  the  output 
of  this  on-the-fly  KWS  is  identical  to  what  would  be  produced  by 
a  standard  FST  based  KWS  [43,44]  system  that  writes  out  lattices, 
compiles  them  into  indexes,  and  then  runs  search.  The  primary  ad¬ 
vantage  of  the  on-the-fly  approach  is  that  we  avoid  the  need  to  write 
out  the  lattices,  which  can  be  extremely  large. 

We  used  cleaned  webcrawls  to  augment  the  ASR  dictionary,  vo¬ 
cabulary  and  LM  for  KWS.  After  addition  of  web  crawls  the  vocab¬ 
ulary  size  of  the  language  models  increases  to  roughly  350A'  for  all 
three  conditions.  This  reduced  the  OOV  rate  of  KWS  queries  by 
nearly  76%  relative  on  the  VLLP  and  the  ALP  data  set  and  64%  on 
the  FLP  set.  The  KWS  results  presented  in  this  paper  are  based  on 
word  lattices  and  query  expansion  of  1000— nbest  applied  for  pho¬ 
netic  search.  The  performance  of  the  KWS  system  is  measure  using 
the  Maximum  Term  Weighted  Value  (MTWV)  metric  described  in 
[45], 

4.  MULTILINGUAL  REPRESENTATIONS  FOR  SWAHILI 

This  section  describes  multilingual  representations  for  Swahili  ASR 
and  KWS.  This  section  details  the  experiments  with  ML  representa- 
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tions  from  CUED  and  RWTH.  The  DNN  used  to  derive  the  CUED 
ML  features  is  similar  to  [1 1],  The  input  features  are  24-dimensional 
log  Mel  magnitude  spectrum  filter  banks,  pitch,  probability  of  voic¬ 
ing,  and  their  derivatives.  The  RWTH  ML  features  are  described 
in  [14]  and  include  long-span  features.  Both  RWTH  and  CUED  mul¬ 
tilingual  networks  are  trained  on  1 1  languages. 

4.1.  ML  Representations  in  Low  Resource  Conditions 

4.1.1.  Data  Sampling 

Based  on  the  results  in  Table  1,  and  the  evaluation  time  constraints, 
we  selected  the  ML  representations,  IBM1  and  IB  M2  for  further  ex¬ 
ploration  in  the  VLLP  condition,  and  IBM2  and  IBM  for  the  ALP 
condition.  The  evaluation  for  the  three  conditions  was  staged  with 
ALP  following  VLLP.  Ideally,  we  would  have  liked  to  use  the  IBM 
ML  representation  from  the  last  row  of  Table  1  for  the  VLLP  condi¬ 
tion  but  could  not  do  so  due  to  the  time  constraints  of  the  evaluation. 

To  illustrate  the  impact  of  ML  representations  derived  from 
different  sampling  ratios,  we  select  the  following  configurations. 
For  the  VLLP  condition,  the  ML  features  are  fused  with  SI, 
PLP+LDA+STC  features  and  used  as  input  features  to  train  a  4- 
layer  DNN.  The  targets  for  this  DNN  are  1000  context-dependent 
states  from  a  DNN-alignment  based  decision  tree.  For  the  ALP 
condition,  the  ML  features  were  speaker-adapted  and  fused  with  a 
second  set  of  ML  representations  obtained  from  RWTH.  Table  2 
presents  the  ASR  systems'  performance  on  the  surprise  language, 
Swahili,  when  using  these  ML  representations.  It  can  be  seen  from 
the  table  that  in  both  conditions,  ML  representations  are  able  to 
derive  better  hidden  language  representations,  if  the  multilingual 
nets  see  more  data  in  each  training  epoch. 


IBM1 

IBM2 

IBM 

VLLP 

65.1 

64.3 

— 

ALP 

— 

60.3 

59.8 

Table  2.  ASR  performance  with  ML  features  generated  from  differ¬ 
ent  sampling  ratios. 

4.1.2.  VLLP 

In  this  section,  we  compare  the  ASR  performance  of  four  different 
configurations  that  use  SI  and  SA,  ML  representations.  Table  3  lists 
WERs  on  the  tuning  set  with  different  ML  features  at  intermediate 
training  steps  of  the  recipe  presented  in  Section  3.1.  In  Table  3,  ‘XE‘ 
refers  to  the  training  step  with  cross-entropy  as  the  objective  func¬ 
tion  and  'sMBR'  refers  to  the  sequence  training  step.  The  two  folds 
of  DNN  training  referred  to  in  Section  3.1  are  denoted  by  suffix  ‘1  ‘ 
and  ‘2‘  respectively.  First,  we  observe  as  the  models  are  refined 
with  re-alignments  during  the  training  steps,  the  ASR  performance 
improves  (illustrated  in  Rows  1  through  Row  4),  regardless  of  the 
type  of  ML  feature  used.  The  gain  in  performance  for  each  of  the 
ML  features,  ranges  from  2.5%  to  4.0%  absolute  (Columns  1  thru 
3).  The  second  observation  is  with  regards  to  the  complimentarity  of 
different  ML  representations.  The  last  column  in  Table  3,  following 
the  same  pipeline,  uses  fused/conntected  ML  features  as  input:  the 
speaker-adapted  IBM2  features  from  IBM  and  the  SI  ML  features 
from  RWTH.  A  reduction  in  WER  of  2.0%  absolute  is  seen  with 
these  combined  features  as  well,  as  the  models  are  refined  during 
various  stages  of  training,  illustrating  the  complimentarity  of  these 
ML  representations.  It  can  also  be  seen  that  this  type  of  a  gain  holds 
through  the  various  intermediate  training  stages.  Third,  we  observe 
that  re-alignments  with  the  first  set  of  models  helps  in  decreasing  the 


WER  further,  yielding  gains  in  the  range  0.4%  to  1.9%  (Compare 
rows  sMBR.l  and  sMBR.2)  absolute  across  the  different  configura¬ 
tions.  Last,  we  observe  that  the  ML  representations  from  the  various 
sites  are  comparable  and  converge  to  more  or  less  the  same  WER 
(Row  4),  with  the  ML  features  from  RWTH  outperforming  the  other 
two  ML  features. 


Stage 

RWTH-SI 

CUED-SI 

IBM2-SI 

IBM2-SA 

+RWTH 

XE.l 

65.7 

66.9 

68.3 

63.3 

sMBR.l 

63.3 

64.8 

66.2 

62.1 

XE.2 

63.5 

65.3 

66.1 

62.2 

sMBR.2 

62.9 

64.4 

64.3 

61.3 

MTWV 

0.4102 

0.4197 

— 

0.4783 

Table  3.  Performance  of  ML  features  on  Swahili  VLLP. 


4.1.3.  Semi-supervised  Learning 

Semi-supervsed  learning  (SSL)  has  shown  to  be  beneficial  to  ASR 
and  keyword  search  in  the  Base  and  OP1  evaluation  periods  [46]. 
Motivated  by  these  previous  results,  the  untranscribed  data  is  first 
decoded  with  an  initial  VLLP  model  trained  on  just  the  transcribed 
data,  and  subsequently  merged  to  form  a  unified,  larger  training  data 
set.  The  ML  representations  are  derived  on  this  larger  data  set  and 
used  to  train  the  final  DNN  on  the  target  language.  With  the  addition 
of  training  data,  the  number  of  output  targets  is  increased  from  1000 
to  3000. 

Table  4  illustrates  the  ASR  and  keyword  search  performance 
(WER/MTWV)  using  SI  ML  features  from  RWTH  and  CUED,  with 
and  without  SSL  training  under  the  VLLP  condition.  SSL  yields 
2.6%  to  3.4%  absolute  reduction  in  WER.  This  also  provides  a  6.3% 
to  8.7%  relative  increase  in  MTWV. 


w/o  SSL 

w/  SSL 

RWTH-SI 

62.9/0.4102 

60.3/0.4458 

CUED-SI 

64.4/0.4197 

61.0/0.4461 

Table  4.  Comparison  of  Swahili  VLLP  performance  (WER/MTWV) 
with  and  without  semi-supervised  learning. 

Table  5  demonstrates  the  performance  of  SA  ML  representations 
with  SSL.  The  PLP-SA  row  refers  to  SSL  applied  to  a  DNN  trained 
on  speaker-adapted  PLP  features  and  results  in  a  WER  of  62.4%  on 
the  tuning  set.  RWTH-SA  refers  to  a  DNN  (described  in  Section 
(4.1)  trained  on  the  speaker-adapted  ML  features  from  RWTH.  This 
model  results  in  a  WER  of  60.0%.  Fusion  of  these  ML  features  with 
the  IBM  IBM2  ML  representations,  provides  a  further  reduction  in 
WER  of  1.3%  absolute.  Addition  of  a  third  speaker-adapted  ML 
feature,  from  CUED  (last  row  in  the  table)  reduces  the  WER  fur¬ 
ther  by  another  1.0%  absolute.  The  three  different  ML  representa¬ 
tions  are  clearly  complimentary  resulting  in  a  2.3%  absolute  reduc¬ 
tion  in  WER  over  using  the  best  single  ML  representation.  When 
compared  to  a  DNN  trained  with  SI,  ML  features  from  RWTH  (See 
Table  3),  we  observe  that  multiple  ML  features  in  conjunction  with 
SSL  reduced  WER  by  5.2%  absolute,  from  62.9%  to  57.7%.  It  is 
interesting  to  note  that  an  SI,  ML  representation  based  DNN  with 
no  SSL(62.9%  from  Table  4),  matches  the  performance  of  a  DNN 
trained  with  simple,  speaker-adapted  PLP  features  in  Table  5.  This 
implies  that  ML  representations  do  capture  acoustic  representations 
well,  i.e.,  ML  features  from  3  hours  of  transcribed  data  on  the  tar¬ 
get  language  can  achieve  the  same  level  of  ASR  performance  as  PLP 
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features  from  3  hours  of  transcribed  data  and  approximately  40  hours 
of  untranscibed  data.  The  last  row  in  this  table  refers  to  a  data  aug¬ 
mentation  technique  originally  presented  in  [47],  Here,  the  speaker- 
adapted  ML  features  from  RWTH  and  the  training  data  is  increased 
8-fold.  Interestingly,  this  model  yields  the  same  ASR  performance 
as  the  fused  ML  representations  in  Row  4,  suggesting  that  ML  repre¬ 
sentations  do  capture  the  acoustics  of  the  target  language  very  well. 


Features 

WER 

MTWV 

PLP-SA  w/o  ML 

62.4 

0.4715 

RWTH- S  A 

60 

0.4684 

RWTH- S  A  +  IBM2-SA 

58.7 

0.4703 

RWTH- S  A  +  IBM2-SA  +  CUED-SA 

57.7 

0.4809 

RWTH-SA  +  8xdata  augmentation 

58.7 

— 

Table  5.  Impact  of  several  ML  features  on  Swahili  VLLP  with  semi- 
supervised  learning. 

4.1.4.  ALP 

Table  6  shows  the  comparison  of  various  ML  features  used  for  the 
ALP  evaluation  scenario.  As  mentioned  in  Section  2.1,  the  IBM 
ML  representation  (IBM)  is  used  for  training  a  DNN  on  the  target 
language.The  first  three  rows  of  the  table  present  ASR  performance 
when  three  different  SI  ML  features  are  used.  Similar  to  the  VLLP 
evaluation  condition,  the  different  ML  representations  are  very  sim¬ 
ilar  in  performance.  Speaker-adaptive  transformation  applied  to  the 
IBM  ML  features  yields  a  reduction  of  0.9%  WER  absolute.  This 
finding  is  consistent  with  the  VLLP  condition,  where  similar  gains 
were  observed  (See  Table  3).  Feature  combination  with  RWTH  ML 
features  gives  an  additional  1.1%  reduction  in  WER;  and  SSL  an  ad¬ 
ditional  2.4%  reduction  in  WER,  This  is  consistent  with  our  previous 
observation  for  the  VLLP  condition.  The  use  of  multiple  ML  repre¬ 
sentations  and  SSL  (last  row)  accounts  for  a  4%  absolute  reduction 
in  WER  over  a  single  ML  feature  (Row  4).  RWTH-SI  features. 


Models 

WER 

MTWV 

RWTH-SI 

61.5 

— 

CUED-SI 

63.4 

— 

IBM-SI 

61.8 

0.4454 

IBM-SA 

60.9 

0.4669 

IBM-SA  +  RWTH 

59.8 

0.4823 

IBM-SA  +  SSL 

58.4 

- 

IBM-SA  +  RWTH  +  SSL 

57.4 

0.4714 

Table  6.  Performance  of  ML  features  on  Swahili  ALP. 


4.2.  ML  Features  on  Swahili  FLP 

In  the  earlier  sections,  we  demonstrated  the  significant  impact  of  ML 
representations  for  the  low-resource  conditions.  In  this  section,  we 
explore  its  use  for  the  FLP  scenario  with  40  hours  of  transcribed 
data.  The  baseline  DNN  is  trained  on  speaker-independent  PLP  fea¬ 
tures  using  the  recipe  outlined  in  Section  4.1  and  yields  a  WER 
of  50.9%  on  the  development  data  set  (See  Table  7).  The  use  of 
speaker-adapted  PLP  features  decreases  the  WER  further  to  49.0%. 
The  addition  of  IBM  ML  features  (IBM)  to  the  SA-PLP  features 
results  in  a  significant  reduction  of  WER  by  4.2%  absolute.  This 
strong  result  highlights  the  value  of  ML  representations  even  when 
40  hours  of  transcribed  data  is  available  in  the  target  language. 


Stages 

Baseline 

NoML-SI 

NoML-SA 

50.9 

49.0 

NoML-SA  +  IBM 

44.8 

Table  7.  Performance  of  ML  features  on  Swahili  FLP. 


5.  FINE  TUNING  OF  ML  FEATURES  ON  THE  TARGET 
LANGUAGE 

In  this  section,  we  investigate  the  value  of  refining  the  ML  repre¬ 
sentations  with  an  additional  training  pass  using  the  available  data 
from  the  target  language.  We  use  the  ALP  evaluation  scenario  for 
this  study.  In  the  configuration  presented  here,  the  parameters  of  the 
second  DNN  in  the  stacked  architecture  were  adjusted  on  the  target 
language.  The  last  layer  of  the  second  DNN  is  randomly  initialized 
with  the  output  targets  set  to  the  context-dependent  states  of  the  tar¬ 
get  language.  The  remaining  layers  are  initialized  with  the  same 
weights  obtained  from  the  multilingual  training. 

Table  8  presents  the  cross-entropy  values  for  different  training 
configurations  that  correspond  to  a  different  set  of  layers  of  the  DNN 
being  updated.  The  WERs  presented  in  this  table  are  a  result  of 
hybrid  decoding  using  this  refined  DNN  directly.  The  first  three  rows 
correspond  to  refining  the  weights  of  the  last  layer,  weights  starting 
from  the  second  hidden  layer  onwards  and  all  layers  of  the  DNN 
respectively.  A  significant  reduction  in  the  objective  and  WER  is 
obtained  when  more  layers  of  the  network  are  tuned  to  the  target 
language. 


Layers 

XENT 

WER 

5 

3.63 

69.2 

3+ 

2.52 

— 

2+ 

2.54 

62.5 

all 

2.56 

— 

Table  8.  Cross-entropy  and  WER  on  Swahili  ALP  after  fine-tuning. 

The  network  from  which  ML  representations  are  derived  is  fine 
tuned  from  the  second  hidden  layer  onwards.  A  DNN  on  the  three 
hours  of  ALP  transcribed  data  was  trained  using  the  recipe  in  Sec¬ 
tion  4.1.  Table  9  captures  the  WER  on  the  tuning  set  at  intermediate 
training  steps.  Even  though  the  fine  tuned  features  outperform  the 
vanilla  ML  features  at  the  early  stages,  the  gains  gradually  disappear 
in  the  subsequent  training  steps. 


Stages 

No  finetune 

Finetune 

XE.l 

65.5 

64.4 

sMBR.l 

63.3 

62.9 

XE.2 

63.0 

62.8 

sMBR.2 

61.8 

61.8 

Table  9.  Comparison  of  ML  features  with  and  without  fine  tuning  on 
Swahili  ALP. 


6.  KEYWORD  SEARCH  ANALYSIS 

In  this  section  we  analyze  the  impact  of  ML  features  on  our  KWS  re¬ 
sults  obtained  using  the  on-the-fly  KWS  described  in  3.2.  We  com¬ 
pare  the  MTWV  achieved  by  systems  trained  with  multilingual  fea¬ 
tures  and  systems  trained  without  multilingual  features  on  all  three 
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BABEL  conditions:  ALP,  VLLP  and  FLP.  The  comparisons  were 
done  on  the  tuning  set  for  ALP  and  VLLP  conditions,  and  the  devel¬ 
opment  set  for  the  FLP  condition.  Table  10  provides  the  tuning  set 
results  for  ALP  and  VLLP.  We  observe  that  ML  features  give  con¬ 
sistent  gains  on  both  the  IV  and  OOV  terms.  We  obtain  a  relative 
MTWV  improvement  of  5.8%  for  ALP  and  2%  for  VLLP.  The  de¬ 
velopment  set  MTWV  results  for  all  three  conditions  are  reported 
in  Table  11.  We  obtain  a  relative  MTWV  improvement  of  4.2% 
for  ALP,  7.6%  for  VLLP,  and  6.7%  for  FLP.  Although  the  results 
presented  in  Table  10  and  Table  1 1  are  obtained  with  word  lattices, 
similar  trends  hold  for  our  morph  and  syllable-based  KWS  systems. 


System 

ML  feats 

MTWV 

IV 

OOV 

Total 

ALP 

Yes 

0.4337 

0.5747 

0.4823 

ALP 

No 

0.4078 

0.5470 

0.4559 

VLLP 

Yes 

0.4335 

0.5638 

0.4809 

VLLP 

No 

0.4105 

0.5787 

0.4715 

Table  10.  Comparison  of  MTWV  between  a  multilingual  system  and 
a  unilingual  system  trained  only  on  Swahili  data  for  ALP  and  VLLP 
on  the  tuning  set.  The  table  also  includes  the  MTWV  breakdown 
for  IV  and  OOV  queries  defined  according  to  the  original  non-web 
vocabularies. 


System 

ML  feats 

MTWV 

IV 

OOV 

Total 

ALP 

Yes 

0.4708 

0.5283 

0.4946 

ALP 

No 

0.4490 

0.5104 

0.4745 

VLLP 

Yes 

0.4870 

0.5071 

0.4957 

VLLP 

No 

0.4430 

0.4870 

0.4605 

FLP 

Yes 

0.5780 

0.5100 

0.5736 

FLP 

No 

0.5413 

0.4780 

0.5374 

Table  11.  Comparison  of  development  set  MTWV  between  a  mul¬ 
tilingual  system  and  a  system  trained  only  on  Swahili  data  for 
ALP,  VLLP  and  FLP  conditions.  The  table  also  includes  the  MTWV 
breakdown  for  IV  and  OOV  queries. 

Figure  2  shows  the  variation  of  MTWV  with  query  length  mea¬ 
sured  by  number  of  graphemes  using  systems  with  and  without  mul¬ 
tilingual  features  for  the  three  conditions  -  FLP,  VLLP  and  ALP. 
We  observe  that  the  use  of  multilingual  features  helps  bridge  the 
gap  between  the  performance  for  the  data-rich  FLP  and  the  data- 
sparse  VLLP/ ALP  conditions.  Multilingual  features  give  consistent 
KWS  performance  gains  for  all  three  conditions.  We  also  note  that 
KWS  performance  increases  with  query  length.  This  is  because  short 
queries  are  usually  more  acoustically  confusable  than  longer  queries. 

7.  CONCLUSIONS 

Multilingual  acoustic  representations  proved  to  be  crucial  for  build¬ 
ing  systems  under  the  strict  resource  and  time  constraints  of  the 
OpenKWS15  Evaluation.  Using  multilingual  representations  signif¬ 
icantly  improved  our  ASR  and  KWS  performance  (relative  9%  for 
WER  and  5%  for  MTWV)  This  paper  presented  our  findings  in  the 
process  of  building  these  systems  which  can  be  summarized  as  fol¬ 
lows 

•  The  data  sampling  strategy  presented  in  the  paper  can  speed 
up  the  training  of  multilingual  representations  without  much 


Fig.  2.  This  figure  shows  the  variation  of  MTWV  with  query  length 
( number  of  graphemes)  for  the  three  conditions  ( FLP,  VLLP  and 
ALP )  using  systems  with  and  without  multilingual  features. 

loss  in  performance. 

•  Fusion  of  diverse  multilingual  representations  yields  substan¬ 
tial  ASR  and  KWS  gains. 

•  Fine-tuning  the  multilingual  representations  on  the  target  lan¬ 
guage  (Swahili)  did  not  improve  performance. 

•  Speaker  adaptation  and  data  augmentation  of  these  represen¬ 
tations  improved  word-error  rate  (WER)  and  KWS  perfor¬ 
mance 

•  Incorporating  un-transcribed  data  through  semi-supervised 
learning,  improves  WER  and  KWS  performance. 

•  Multilingual  features  were  helpful  even  when  forty  hours  of 
transcribed  audio  in  the  target  language  is  available. 

The  final  KWS  submission  to  the  OpenKWS15  evaluation  was  a 
combination  of  multiple  systems  using  multilingual  representations 
developed  at  IBM,  RWTH,  and  CUED.  It  resulted  in  an  MTWV  of 
0.5888  for  VLLP  and  0.6020  for  ALP  on  the  tuning  set.  These  KWS 
systems  yielded  the  best  ATWVs1  on  the  evaluation  data  across  all 
three  conditions  -  ALP:  0.5952,  VLLP:  0.5797  and  FLP:  0.6548.  It 
is  important  to  note  that  an  ATWV  of  0.3  is  considered  acceptable 
and  is  the  program  goal  for  Babel  OP2. 
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